Red Wine Data: This datset 1599 observations of 13 variables associated
with the attributes of the red variant of a Portuguses wine, Vino Verde. I will
use this data to see what attributes are most common among “high” quality
Vino Verde wines. In doing so I am going to create three categories of wine based
on their quality rating. These will be: High, Medium, and Low.
Lets take a look at the new structure of our dataset after converting quality
to an ordered factor form and the new varible “rating” that was created to group
quality into three buckets. We will also take a look at the distribution of the
new variable with a histogram.
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ rating : Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 2 2 2 2 2 2 3 3 2 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating
## Min. : 8.40 3: 10 Low : 63
## 1st Qu.: 9.50 4: 53 Medium:1319
## Median :10.20 5:681 High : 217
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
In the dataset summary we can see the addition of the new variable “rating” and
the associated counts at each level. We can also see the new summary on quality
since it has been tranformed to a factor; it’s counts are also summarized.
The histogram shows the counts of each rating. It’s interesting to note the
distribution. There is a excessive amount medium rated wines and few highly
rated with even less on the low end. I think at this point I would have to
to question the manual quality assignment or the sample collection method or
maybe both.
Distribution: From the distribution of the ratings above, we can see there
is a significant number of wines in the Medium rating category meaning they
have a quality value of 5 or 6. I am not sue why that is, it could be due to
data collection method - maybe this is not a random sample. So begin with I
would like to take a look at the distribution of all the other variables paying
particular attention to a few of the variables that, at first thought, I think
would have an impact on quality. These include: Alcohol, pH, Sulphates and
Density.
Data Structure: The dataset on red wines consits of 1599 rows
(observations) of data. Eachobservation contains 13 columns (variables) of
attributes of the data. The primary categorical varible of this dataset is
quality and the remaining variables are contiuous in nature and describe the
chemical and physical properties of the wine. I noticed the following
distributions in the histograms:
- Normal: Quality, pH, Density, Volatile Acidity
- Positive Skewed: Alcohol, Citric Acid, Fixed Acidity, Free Sulfur
Dioxide, Total Sulfur Dioxide, Sulphates
- Long Tail: Chlordies, Residual Sugars
Transform the Data: I’d like to take another look at some of the data
that may be over dispersed. In particular I want to take another look at the
long tail data for Chlorides and Residual Sugars as well as the positive skewed
data of the Sulphates using both log base 10 and square root methods.
It appears both the square root and log transformations brings the data a
litte closer to a normal distribution, however there are still several
outliers.
Let’s take a look at the relationship between quality and the 4 variables I
predicted would have the most effect on predicting a high quality wine. Those
include Alcohol, pH, Sulphates, and Density.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] 0.4761663
This plot was originally overplotted so I added an alpha of .3 to get a little
better visual of the data points. We can see in this distribution there is a
correlation between higher alcohol content and a higher quality rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] -0.05773139
There is little visual evidence of a correlation between pH and quality. If
a correlation exists it appears to be a minor negative correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] -0.1749192
Again, as with pH, we are seeing very little correlation here with quality and
what we do see is a neagtive correlation; meaning higher quality wines are less
dense. This may make some sense given that higher quality wines tend to have
a higher alcohol concentraion.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4815 -0.2596 -0.2076 -0.1934 -0.1367 0.3010
## [1] 0.3086419
As with alcohol, we are seeing a poitive correlation between sulphates and
quality. There seems to be several outliers in the quality ranges of 5 and 6
but overall the correlation loos fairly strong.
From the above scatterplots we can see ther appears to be a significant
correlation between the percent alcohol by volume and quality and somewhat of a
correlation between sulphates and quality. pH and Density seem to have no
correlation with the quality of the wine.
Given this information I will create a correalation table between all of the
variables to see what other correlation may exist in order to determine what
further analysis should be performed.
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -------------------------------------------------------------------
According to the Correlation Table we see that in addition to alcohol and
sulphates – citric.acid, volatile.acidity have a the highest correlation with
quality of wine. We can take a look at this correlation in visual form below:
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## [1] 0.2263725
According to our correlation table quality has a correlation value of 0.2264
with citric acid. This is a good correlation and can be seen in the above plot.
Reading the provided dataset notes, citric acid adds a taste of freshness to the
wine, this would make sense to see higher rated wines having higher levels of
citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] -0.3905578
Interesting enough we see the opposite of citric acid in volatile acids which
again, according to provided notes, has the affect of making a wine taste
unpleasent. So here the correlation with quality is strongly negative - meaning
higher rated wines have less volatile acid concentrations in them.
Below are some additional variable pairs with significant levels of
correlation to each other.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] -0.4961798
I mentioned earlier on the plot for density and quality that we may see this
correlation between density and alcohol. While visually density did not show much
of an impact on quality, it does in fact have a negative correlation at -0.1749.
And as you can see from above density and alcohol have a negative correlation of
-0.4962.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] 0.6680473
Density and fixed acidity have one of the strongest correlation in the model
at 0.668.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] 0.3552834
Density and residual sugar are correlated at 0.3553. If we were going to
investigate this further it may make sense to get a better visual by removing
the outliers and decreasing the breaks on the x axis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] 0.6717034
Fixed acidity and citric acid are correlated at 0.6717. This probably should
not be a surprise as an increase in citric acid would naturally add to the
acidity of the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] -0.6829782
pH and fixed acidity are the higest correlated variable in the dataset at
-0.683. Again, I’m not sure that this adds much to the analysis as one would
expect to see pH decrese at lower levels of acid.
We know from our intitial observations that alcohol and volatile acidity
contribute significantly to the quatliy of a red wine. Lets look at combining a few
these varibles to see what that looks like.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
We can see from looking at the above plot that higher quality wines tipcally
have a higher concentration of alcohol while at the same time having a lower
level of volitale acidity.
We can now add in sulphates to the plot to see its imapct on the quality of a
red wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Again we see higher quality wines typically have higer alcohol content, lower
volatile acidty and are higher is sulphates (hue).
Additionlly we can look at the impact of citric acid and sulphates on the
wine rating; see below.
As shown above, we typically see wines with higher ratings consisting of
higher levels of both citric acid and sulphates. We will utilize these
variables along with a few other variables with high correlation to create a
linear model to see if we can predict what makes up a high quality wine below.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## m2: lm(formula = as.numeric(quality) ~ alcohol + pH, data = wine)
## m3: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates,
## data = wine)
## m4: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates +
## density, data = wine)
## m5: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates +
## density + citric.acid, data = wine)
## m6: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates +
## density + citric.acid + volatile.acidity, data = wine)
##
## ========================================================================================================
## m1 m2 m3 m4 m5 m6
## --------------------------------------------------------------------------------------------------------
## (Intercept) -0.125 2.426*** 1.345*** 3.459 18.500 -13.771
## (0.175) (0.387) (0.401) (11.213) (12.040) (11.922)
## alcohol 0.361*** 0.386*** 0.367*** 0.365*** 0.337*** 0.342***
## (0.017) (0.017) (0.017) (0.019) (0.021) (0.020)
## pH -0.850*** -0.635*** -0.641*** -0.402** -0.478***
## (0.116) (0.116) (0.120) (0.139) (0.134)
## sulphates 0.868*** 0.872*** 0.811*** 0.656***
## (0.104) (0.106) (0.107) (0.104)
## density -2.088 -17.745 15.845
## (11.061) (11.971) (11.885)
## citric.acid 0.405*** -0.378**
## (0.121) (0.135)
## volatile.acidity -1.322***
## (0.116)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.252 0.283 0.283 0.288 0.342
## adj. R-squared 0.226 0.251 0.282 0.282 0.286 0.340
## sigma 0.710 0.699 0.684 0.685 0.682 0.656
## F 468.267 268.888 210.183 157.551 129.110 137.956
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1694.466 -1660.297 -1660.279 -1654.636 -1591.909
## Deviance 805.870 779.508 746.896 746.879 741.626 685.664
## AIC 3448.114 3396.931 3330.594 3332.558 3323.272 3199.818
## BIC 3464.245 3418.440 3357.480 3364.821 3360.912 3242.835
## N 1599 1599 1599 1599 1599 1599
## ========================================================================================================
The linear multivariant model above can explain approximately 34% of the
variance in wine quality utilizing the following regression formula obtained from
the model:
WineQuality = -13.771 + 0.342(alcohol) - 0.478(pH) + 0.656(sulphates)
+ 15.845(sulphates) - 0.378(citric.acid) - 1.322(volatile.acidity)
##
## 3 4 5 6 7 8
## 0.63 3.31 42.59 39.90 12.45 1.13
This was the first plot I created and one of the most surprising things to me
was the number of wines with a quality number of 5 or 6. Combined they make up
over 80% of the occurances in the dataset. With such a high concentration of
occurances in these two buckets, I assumed this would be a difficult assignment
to predict the qualities of a good wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
This plot was intersting because I just happened to stumble on it. I was
trying to see how many variables I could fit on one plot yet still make it
readable. I was also looking to use the varible rating that I created which
groups the quality ratings into three groups. I think this plot does a lot
without being too busy. It shows how percent alcohol, sulphates, and volatile
acidity combine together in higher rated qualities of wine.
I took this plot a few steps further than the one I originally created in the
beginning of the analysis. Alcohol has the stronest correlation with quality of
all the varables. I added box plots to the distributions along with a summary
of the mean for each quality level which empasize the strong correlation between
alcohol and quality.
First I would like to address future work that could be done with this
dataset. I am sure ther would be a lot of other things tht could be done with
this data, but to be more thorough I believe there should be a more
comprehesvie data set. The lack of data on both ends of the spectrum in low
low quality and high quality wines could could help solidify any models created
with the data.
Even given the preceived lack of data, there was enough evidence to suggested a
strong correlation between quality and alcohol as well as between quality and
sulphates, citric.acid, and volatile.acidity. The majority of my analysis
focused on these variables. I was able to come up with a model that can account
for 34% of the variance associated with the quality rating of this particular
red wine. My assumption is that there is a lot more that goes into producing
quality red wines than what I have exposed here. I would also be willing to bet
that there are certain geographical and ecologicals variable that play a huge
part in wine making not to mention the human nature of grading wines.
It would be interesting to see a different dataset the takes on many other
than what was presented here.
1. https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt \
(dataset description)
2. http://rmarkdown.rstudio.com/authoring_basics.html (r - markdown documentation)
3. https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/select\
(dyplr - select function so remove colums)
4. https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate\
(dyplr - transform quality back to a number)
5. http://rapporter.github.io/pander/ (install pander to output the crrelation\
table)
6. https://www.rdocumentation.org/packages/memisc/versions/0.99.14.3/topics/mtable\
(sdigits funtion for mtable)
7. https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2\
(center plot title)